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Abstract 

Scientific progress is driven by the availability of information, which makes it essential that data be broadly, easily 
and rapidly accessible to researchers in every field. In addition to being good scientific practice, provision of 
supporting data in a convenient way increases experimental transparency and improves research efficiency by 
reducing unnecessary duplication of experiments. There are, however, serious constraints that limit extensive data 
dissemination. One such constraint is that, despite providing a major foundation of data to the advantage of entire 
community, data producers rarely receive the credit they deserve for the substantial amount of time and effort they 
spend creating these resources. In this regard, a formal system that provides recognition for data producers would 
serve to incentivize them to share more of their data. 

The process of data citation, in which the data themselves are cited and referenced in journal articles as persistently 
identifiable bibliographic entities, is a potential way to properly acknowledge data output. The recent publication of 
several sorghum genomes in Genome Biology is a notable first example of good data citation practice in the field of 
genomics and demonstrates the practicalities and formatting required for doing so. It also illustrates how effective 
use of persistent identifiers can augment the submission of data to the current standard scientific repositories. 



Discussion 

One of the key lessons learned from the Human Genome 
Project, taking a page from the C. elegans community [1], 
was that making data broadly and freely available prior to 
publication was profoundly valuable to the field of gen- 
omics [2]. Subsequent genomics projects have tried to 
follow this practice as laid out in the Bermuda Rules [3] 
and ultimately enshrined in the Fort Lauderdale agree- 
ment [4]. The wider biological science community has 
also attempted to follow similar practices, as outlined in 
the guidelines published from the Toronto International 
Data Release Workshop [5], but adoption has been held 
back by a lack of easy-to-access repository infrastructure 
for many fields as well as an absence of incentives for 
authors to go through the time and effort necessary to 
make their work openly and easily available to others. 

The benefits of making data available to the research 
community as a whole can be calculated [6]: there is a 
measurable trend towards an authors work accumulating 
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additional citations as a result of the supporting data 
being publically accessible [7,8]. Again, however, the lack 
of a universally recognized tagging system linking inves- 
tigators to their deposited data has hindered authors 
from receiving due credit [9]. Recent scandals relating to 
falsified data that went long undetected in medicine [10] 
and psychology [11] also highlight the need to make data 
easily accessible for purposes of validation and to main- 
tain public trust in science. One notable attempt to ad- 
dress the issues of gaining complete access to data is the 
Dryad repository, which serves as a storehouse for smal- 
ler datasets directly affiliated with publications in the 
biosciences [12]. 

The next step forward in this regard is the recent publi- 
cation of the genomes of three strains of the important 
food crop Sorghum bicolor published in Genome Biology 
[13]. This work follows the best practices of the genomics 
community by having the supporting raw and useful pro- 
cessed data available in the relevant and available data re- 
positories. However, for the first time in the long- 
established data-sharing practices of the community, this 
process has been supplemented specifically by integrating 
into the reference section a citation for the collective data- 
set. Thus, in addition to having the raw data [SRA046843], 
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assemblies [GenBank: AHAO00000000-AHAQ00000000], 
mutation [1056306], and structural variation data [nstd63] 
available in a number of NCBI databases, the data citation 
included in the reference section [14] makes available the 
same data, along with additional information, from a single 
point of access. In addition, and perhaps of even greater 
value, these data are persistently linked via a citable Data- 
Cite Digital Object Identifier (DOI), and are hosted on the 
GigaScience [15] GigaDB database [16]. 

GigaScience and GigaDB 

GigaScience is a journal and data publishing project set 
up in conjunction with BGI (formally the Beijing Gen- 
omics Institute) [17], one of the worlds largest genomics 
data producers, and BioMed Central [18]. Using BGIs 
large data storage and cloud computing infrastructure, 
GigaScience has created a novel publication format that 
integrates manuscript publication with data hosting. The 
data associated with articles are hosted in the connected 
GigaDB database and given DOIs to make them more 
searchable and trackable as well as independently citable. 

To demonstrate the utility of data citation and to pro- 
mote extremely rapid data release and dissemination of 
unpublished datasets independently of the journal, GigaDB 
has recently released large-scale data resources created by 
the BGI and its many external collaborations. Importantly, 
these data are the first cases in which whole genome-type 
data have been released with DOIs. GigaScience has been 
working with DataCite through the British Library to en- 
able these and future datasets to receive DOIs. 

DataCite 

Founded in December 2009, DataCite is an international 
partnership working towards a global citation framework 
for research data, with the aim of enabling researchers to 
find, access, and reuse datasets with confidence. Since its 
foundation, DataCite has been building a community to 
collaboratively develop services and good practices for 
data citation [19,20]. 

DataCite is initially leveraging the DOI system for re- 
search data. DOIs identify a resource, rather than the lo- 
cation of the resource, allowing the creation of persistent 
and stable references and offering an easy way to con- 
nect articles with their underlying data. The DOI system 
is governed by the International DOI Foundation (IDF), 
a non-profit organisation, with DataCite operating as a 
Registration Agency. This relationship confers DataCite 
and its member organisations [21] with rights and infra- 
structure to register DOIs. The crucial advantage of the 
DOI system over alternatives is that it is already familiar 
to researchers, publishers, and libraries. 

DataCite provides a service for trusted repositories to 
mint DOIs for datasets and, since August 2011, collects 
mandatory metadata on every DOI minted. Services are 



being developed around this open 'metadata store' to 
promote discovery and to facilitate access to original 
cited datasets. DataCite DOIs resolve to public pages de- 
scribing the datasets and providing a route for access. 
DataCite is also developing a content negotiation service 
that will allow metadata, and possibly even the datasets 
themselves, to be requested directly via DOIs [22]. 

A cost is associated with managing persistence and 
with assigning identifiers, and so there is a cost associated 
with DOIs. In April 2011 the International DOI Founda- 
tion changed the charging model from a per-DOI cost to 
a cost-sharing model [23]. Essentially this means that re- 
positories can pay a flat-fee to a Registration Agency for 
the right to mint virtually unlimited DOIs. 

A brief history of data citation 

Environmental sciences researchers have been using the 
Pangaea [24] database to host data associated with their 
manuscripts for many years (for an example from 2005 
see [25]), and Dryad has more recently created a similar 
model with biomedical data [26]. These have been not- 
able successes in the movement for better data access. 
However, in cases where journals have included such 
datasets in their references, the datasets are often treated 
and formatted the same way as links on the web and, 
thus, are not listed by citation indices (such as article 
[25] and its associated dataset [27]). The long-established 
Protein Data Bank (PDB) biological macromolecular 
structure data archive [28] also uses DOIs, but other 
than rare exceptions [29,30], very few research articles 
have used DOIs to reference structures. Publishing data 
by wrapping and integrating it into the established jour- 
nal infrastructure is underway in a number of research 
areas such as the Earth Systems Science Data journal 
[31], and there have been attempts to semantically en- 
hance and integrate links with data via DOIs in biodiver- 
sity [32] and infectious disease research [33]. This has 
facilitated discovery and access to the underlying data, 
but these examples have not utilized or been citable 
using current journal indexing services. 

As DOIs issued for GigaDB datasets have been asso- 
ciated with and published alongside journal manuscripts, 
the GigaDB project appears to be the first time that gen- 
omic datasets have been released prior to manuscript 
publication in this citable DOI form. Although there have 
been public calls [2] and journal editorials [9] encouraging 
such a system, the practicalities and consequences of re- 
leasing data in a citable form before the publication of 
their associated manuscripts have been unclear, especially 
with widely varying journal editorial policies regarding 
pre-publication dissemination of results. Relevant to this 
is a commonly acknowledged editorial guideline from the 
New England Journal of Medicine that outlines limitations 
on prepublication release of information known as the 
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"Ingelfinger rule" [34]. It effectively states that a manu- 
script may not be considered for publication if its substance 
has been submitted or reported elsewhere and it has made 
many researchers wary of publicizing preliminary data. 
However, there are a number of ambiguities as to how this 
restriction is reconciled with the biological, and in particu- 
lar the genomic, community's code of practice regarding 
pre-publication data deposition in public databases. 

An interesting test case was the first dataset to be 
given a data citation by BGI [35]. This high-profile data- 
set, the first publicly available genome of the E. coli 
0104:H4 pathogen responsible for the 2011 European 
outbreak, was released prior to the publication of an 
associated article [36]. Researchers at the BGI collabo- 
rated with the University Medical Centre Hamburg- 
Eppendorf to rapidly sequence the genome of the patho- 
gen. Due to the seriousness of the situation (with 50 
human deaths and over 4000 people infected), it was 
clear that it was not in the public s best interest to hold 
back that information in order to follow standard re- 
search practices of first analysing the data and then wait- 
ing to release them until after acceptance of the resulting 
manuscript. Instead, the decision was made to immedi- 
ately release the dataset under the most open public do- 
main waiver, CC0 [37], to maximize its use by the 
community. By giving the dataset a DOI it was possible 
to not only enable the research community to cite and 
credit the authors, but also to mark the time of data re- 
lease, making this less commonly used route of data re- 
lease more attractive to the authors. 

Of greatest interest, perhaps, was that the rapid release 
of the E. coli genome data enabled an international com- 
munity of "crowdsourced" researchers to pool resources 
and carry out expeditious "open-source" analysis of the 
organism, a level of instantaneous collaboration that has 
not been seen before. This high-profile distributed prob- 
lem-solving approach substantially aided in limiting the 
health crisis, with strain-specific diagnostic primers dis- 
seminated within five days of the release of the sequence 
data (sequence available from the DOI landing page 
[35]), and the draft unassembled genome sequence data 
subsequently enabled the development of a targeted bac- 
tericidal agent to kill the pathogen [38]. It also brought 
to light a potentially useful way of scientifically addres- 
sing similar outbreaks in the future. Additionally, results 
of these analyses were published in the New England 
Journal of Medicine a few months later, showing that 
data citation can complement the traditional forms of 
academic credit [36]. 

Many other unpublished datasets have been released in 
the GigaDB database with DataCite DOIs, and a number 
of these have subsequently been used in scholarly journal 
articles. DOIs for two BGI datasets [39,40] used in a Na- 
ture Biotechnology [41] article were listed in the articles 



Accession Codes section, but were not included in the 
reference section due to citation limits and policies treat- 
ing them as non-refereed sources of information such as 
websites. The authors of the Sorghum bicolor genome 
paper worked very closely with the editors of Genome 
Biology to ensure that it followed the best practice guide- 
lines and current recommendations regarding how best 
to cite data, and included the data citation in the refer- 
ence section [42]. Since the publication of this paper, 
there have been positive recent developments from pub- 
lishers such as Springer and Nature providing examples 
of data cited in this way [43,44]. 

Discoverability, accessibility, and preservation 

Discoverability and accessibility of data are separate 
issues and should be treated as such. There are times 
when it may be necessary to limit accessibility to a data- 
set, but this should not prevent it from being archived 
and made discoverable via open metadata and a persist- 
ent identifier such as a DOI. Equally, discoverability and 
long-term preservation of data can be dealt with as sep- 
arate issues. Repositories should be carefully selected to 
ensure a preservation plan is in place, but we must 
accept that some datasets will be lost over time, for ex- 
ample due to limited storage capacity. By ensuring that 
persistent identifier organizations are provided with 
open metadata, it should at least be possible to keep a 
record of a datasets existence and provenance however. 
DataCite, for example, collects metadata for all datasets 
that are allocated DataCite DOIs. In the event of a data- 
set becoming unavailable, the appropriate DOI can be 
updated to resolve to the associated metadata record. 

How to cite data 

Given the importance of data in promoting research, and 
the needs of data producers to gain credit for their work 
in the same manner that researchers using these data are 
recognized, datasets should be cited as research articles 
are cited. Further to this, even though many journals 
currently tend to remove URLs from the reference list, 
both DataCite and CrossRef recommend displaying DOIs 
within references as full URLS. They consider this best 
practice because it emphasises the actionable link and 
allows readers to readily access the underlying data. The 
DOI in URL form not only serves the same function as a 
journal volume, issue and page number do for a printed 
article, but also gives the combined advantages of linked 
access and the assurance of persistence, the lack of 
which in the past being part of the reason many journals 
have been reluctant to cite plain vanilla URLs. An ex- 
ample of what can be considered a new gold standard 
for data citation is the way in which the data that under- 
pin the recently published sorghum paper [14] were cited 
in the reference section, as follows: 
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Zheng, L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; 
Dong, S-S; Liu, T-F; Jiang, S; Ramachandran, S; Liu, 
C-M; Jing, H-C (2011): Genome data from sweet and 
grain sorghum (Sorghum bicolor). GigaScience. 
http://dx.doi.org/ia5524/100012 

It is important to note that the DOI will always point 
to the version of the data that were used for the study 
in which they were cited, enabling other researchers to 
use them for validation and comparative studies with 
confidence. 

Formally citing the dataset in this manner not only 
clearly identifies the dataset, but also paves the way for 
data discovery and citation tracking via existing bibliomet- 
ric services. These services have traditionally focused on 
DOIs in the reference section only, so including data cita- 
tions here can help such services to start using them more 
quickly as existing systems will require less modification. 

Aiding the adoption of data citation 

Many journals now include 'Data Accessibility' sections 
that provide information about the data used in the 
paper and explain where these data can be accessed. 
Journal editors are also starting to draw up new guide- 
lines for data citation, but these approaches still remain 
inconsistent. In this regard, it is crucial that high-profile 
journals take the lead in citing data in a manner that 
drives adoption of good practices and raises awareness of 
this issue to the broader research community. 

Journal editors and publishers who promote consistent 
and equitable means of citing data, as exemplified by the 
handlers of the sorghum paper [13], should be com- 
mended. Defining formal mechanisms for dataset cit- 
ation is essential for making datasets more readily 
tracked and easily accessed. Furthermore, it provides the 
only real means for data producers to obtain appropriate 
recognition for their work, promoting more rapid data 
release potentially prior to the much more time-consum- 
ing process of manuscript publication. It also gives rec- 
ognition and makes clear the role of the researchers 
investing the most effort in producing the dataset, who 
may not have received similar credit in an eventual more 
analysis-focused publication. As research is being carried 
out with ever increasing amounts of data, widespread 
data availability will serve to enhance scientific progress 
and provide greater public benefit from the investments 
made to create these sharable data resources. 
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